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Knowledge  Engineering  Report: 

An  Expert  System  for  Selecting  Reliability  Index 

ZhongminLi 

University  of  Southern  California 
1.  Introduction 

This  report  documents  the  knowledge  encoded  in  Reliability  Index  Knowledge 
Base  (RIKB).  The  knowledge  Is  presented  in  terms  of  the  conceptualizations  of  the 
Judgmental  knowledge  used  in  various  level  of  decision-makings  during  a 
consultation. 

We  used  a  declarative  knowledge  representation  mechanism  to  encode  the 
knowledge  necessary  for  selecting  reliability  Index.  In  declarative  representation 
of  knowledge,  the  basic  constructs  of  a  knowledge  base  are  production  rules  and 
attribute-value  pairs.  In  RIKB,  attributes  represent  properties,  and 
characteristics  of  reliability  indexes  that  affect  the  decision-makings  in 
selecting  reliability  Indexes  for  a  given  Criterion-Referenced  Test.  The  value 
specifies  the  specific  nature  of  the  attributes  In  a  particular  situation  (decision¬ 
making  point).  For  example,  INTENDED-USE  (of  test  score)  Is  an  attribute,  and  the 
value  could  be  decision,  description,  or  program-evaluation.  In  the  following 
description,  the  attributes  will  be  Indicated  by  upper  case,  and  the  value  will  be  in 
lower  case.  An  attribute  value  can  either  derived  from  rules,  or  directly  get  from 
user  input.  The  form  rule-1,  where  I  is  a  number,  Indicates  which  rule  Is  used  to 
determine  the  attribute  value.  The  form  questlon-l,  where  I  Is  also  a  number, 
indicates  which  question  will  be  asked  to  get  the  information  for  the  attribute. 
Refer  to  the  knowledge  base  list  for  the  exact  wording  of  the  rules,  and  questions. 

The  following  sections  are  organized  based  on  the  four  phases  in  selecting 
reliability  index:  (a)  determining  reliability  category,  (b)  determining  reliability 
index,  (c)  determining  reliability  design,  and  (d)  reporting  consultation  results. 

2.  Determining  Reliability  Category 

This  section  describes  the  knowledge  related  to  the  determination  of  which 
reliability  category  Is  suitable  for  a  given  Criterion-Referenced  Test.  This  phase 


includes  four  tasks:  (a)  query  how  knowledgeable  is  the  user,  (b)  query  how 
important  of  the  test-score  use,  (c)  determining  intended  use  of  test  score,  (d) 
determining  subcategory  of  test  score  use,  and  (e)  determining  reliability 
category.  Although  the  information  gathered  in  the  first  two  tasks  does  not 
contribute  to  the  reasoning  in  task  (c),  (d),  and  (e),  they  serves  as  front-end  to  the 
knowledge  system.  The  information  will  be  used  later  during  the  consultation. 

The  information  is  collected  up  front  to  represent  the  natural  flow  of  the 
consultation.  Rule-3  controls  the  order  of  the  four  tasks.  For  the  purpose  of 
scoping,  rule-3  serves  the  function  of  screening  cases  that  can  be  handled  by  the 
system. 

2.1.  Determining  How  Knowledgeable  is  The  User 

A  user's  knowledge  about  measurement  in  general,  and  reliability  index  In 
specific  provides  important  information  regarding  to  how  detail  the  expert 
system's  recommendation  should  be,  and  what  consultation  mode  the  system  will 
be  in.1  This  information  is  represented  by  the  USER-KNOWLEDGEABLE  attribute, 
which  can  be  either  "yes"  or  "no".  Figure  2.1  presents  the  decision  tree  for 
determining  the  USER-KNOWLEDGEABLE  attribute. 
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Figure  2.1.  Determining  USER-KNOWLEDGEALBE. 


hn  the  first  knowledge  engineering  session,  Dr.  Hambleton  indicates  that  if  a  user  Is  not  knowledgeable 
about  test  measurement,  he  will  recommend  simple  reliability  index  and  related  test  design.  Also,  the 
consultation  will  be  directed  to  tell  the  user  what  to  do.  For  a  knowledgeable  user,  he  will  recommend 
more  statistically  powerful  indexes,  and  also  provide  several  options  to  the  user. 
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The  decision  Is  based  on  a  taxonomy  of  possible  users  (USER_OCCUPATlON 
attribute),  which  consists  of  four  categories:  (a)  classroom  teachers,  (b)  district 
level  test  maker,  (c)  state  level  test  maker,  and  (d)  test  publisher.1  in  the  current 
Implementation,  we  treat  category  (b),  (c),  and  (d)  as  if  they  are  the  same,  it  is 
assumed  that  a  user  from  these  three  categories  Is  knowledgeable  about  test 
measurement.  If  a  user  is  a  classroom  teacher,  the  expert  system  uses  question- 
9  (KNOWS_CONCEPT_OF_REL  I  ABILITY  attribute)  to  query  that  whether  the  user  is 
knowledgeable  about  test  measurements.  The  default  value  is  that  the  user  has 
little  knowledge  about  test  measurement  (rule-29  will  inform  the  user  if  the 
default  value  is  used). 

2.2.  Query  Importance  of  Test  Score 

The  importance  of  test  score  use  Is  judged  based  on  what  kinds  of  decisions 
will  be  made  from  the  test  results.  Thus,  the  Importance  of  score  Is  represented 
by  the  IMPORT ANCE_OF_RESULT  attribute,  and  how  the  test  results  will  be  used  is 
represented  by  the  HOW_TEST_USED  attribute.  Figure  2.2  shows  a  decision  table 
for  determining  the  IMPORT ANCE_OF_RESULT  attribute. 
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question- 1 
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Figure  2.2.  Determining  IMPORT ANCE_OF_RESULT. 


This  decision  table  provides  a  possible  scale  for  Judging  the  Importance  of 
test  results.2  In  the  "longer-term  decisions"  category,  there  are  several  sub- 
categories  such  as  assigning  students  to  special  programs,  assigning  mid-term 
grades,  and  assigning  final  grades,  etc.  These  Information  are  included  in  the 
question- 1. 


^erk  ( 1984)  classified  three  types  of  CRT  practitioners:  (a)  classroom  teachers,  (b)  district  and  state 
level  test  makers,  and  ( c)  test  publisher.  We  separate  item  ( b)  to  make  the  taxonomy  contains  four 
categories  in  anticipation  that  different  treatments  might  be  necessary  for  district  and  state  level  test 
makers. 

2The  taxonomy  of  how  test  score  will  be  used  Is  based  on  Hambleton's  background  nodes  for  the  project 
( December  22 ,1987,  p.6-).  However ,  the  notes  only  provides  the  order  of  the  Importance  of  test  results. 
Therefore,  the  numbers  in  the  table  only  serve  the  purpose  of  representing  the  order  of  importance. 
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The  Intended  uses  of  test  score  (USE_CATEGORY)  consists  of  three 


categories:  (a)  decision,  (b)  description,  and  (c)  program  evaluation  (Hambleton, 
1987).  Figure  2.3  shows  the  decision  tree  used  to  determine  the  Intended  use  of 
test  result  (USE-CATEGORY  attribute).  . 


Although  users  generally  know  the  Intended  use  of  test  score,  they  may  still 
have  difficulty  to  select  from  the  category  due  to  unfamiliar  with  the  definition 
of  the  category  or  this  classification  system.  Thus,  this  piece  of  consultation  is 
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designed  to  provide  two  levels  of  assistance  to  the  users.  At  the  first  level, 
definitions  of  each  category  is  presented  when  asking  the  users  to  select  an 
intended  test  score  use.  If  the  users  still  have  difficulty,  they  can  type  in 
“unknown",  the  system  will  enter  a  lower  level  query  mode  to  provide  more 
assistance. 

In  the  current  implementation,  there  is  one  question  corresponding  to  each 
of  the  three  test  score  use.1  A  confirmation  for  the  users  to  a  question  will  lead 
to  the  conclusion  to  the  test  score  use  corresponding  to  the  question.  The 
knowledge  coded  in  rule- 10  to  rule- 12  assures  that  there  will  be  only  one  test 
score  use2  (USE-CATEGORY  is  a  single  value). 

2.4  Determining  Subcategory  of  Test  Use 

There  are  two  different  situations  that  reliability  information  is  valued:  (a) 
in  the  test  development  process,  and  (b)  as  one  of  the  criteria  used  to  evaluate  an 
intended  test  use  (Hambleton,  1 988).  USE_SUBCATEGORY  is  the  attribute  that 
represents  this  Information.  Figure  2 A  Is  the  decision  tree  used  to  determining 
the  value  for  the  attribute. 

Among  the  two  subcategories 
of  intended  test  use,  the  default 
one  Is  for  "evaluation  of  Intended 
test  use"  because  it  represents 
most  cases  that  reliability 
information  is  assessed.  Also, 
current  version  of  the  system  can 
consult  on  the  "evaluation  of 
intended  test  use"  subcategory .3 
Therefore,  If  a  user  does  not  know 
the  subcategory  of  intended  test  use,  the  system  assumes  that  It  Is  used  for 
"evaluation  of  Intended  test  use". 


Figure  2.4  Determining  USE-SUBCATEGORY. 


’The  three  questions  for  determining  intended  test  use  are  elicited  during  first  knowledge  engineering 
session  with  Or.  Hambleton. 

Although  It  has  been  discussed  in  the  knowledge  engineering  session  that  there  might  be  multiple  test 
score  uses,  the  prototype  expert  system  assumes  that  only  one  USE-CATEGORY  Is  appropriate  for  a  given 
test. 

3|t  is  not  clear  yet  that  how  different  it  is  between  using  the  test  score  In  test  development  process  and 
using  the  score  for  evaluation  of  an  intended  test  use.  Since  the  concept  of  using  the  test  score  in  test 
development  process  Is  relatively  new,  It  Is  not  Included  in  the  prototype  system. 
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2.5^  Determining  Reliability  Category 

Berk  (1984)  categorizes  reliability  index  for  Crlteria-Referenced  Test  into 
three  categories:  (a)  threshold  loss,  (b)  squared-error  loss,  and  (c)  domain  score 
estimation.  This  categorization  is  used  to  limit  the  searching  space  for  selecting 
reliability  indexes.  Hambieton  (1988)  proposes  the  test  use  category  as  (a) 
decisions,  (b)  descriptions,  and  (c)  program  evaluation.  The  relationship  between 
the  two  categories  is  straight-forward.  The  descriptive  use  of  test  corresponds 
to  domain  score  estimation  category.  The  decision  category  corresponds  to 
threshold  loss  and  squared-error  loss  function  categories.  The  correspondence  for 
program  evaluation  use  of  test  has  not  yet  been  elicited.1 

Figure  2.5  shows  a  decision  tree  which  concludes  RELIABIUTY_CATEGORY 
attribute. 


Figure  2.5  Determining  REL 1 AB I L I  TY_C  ATEGORY. 


Among  the  three  reliability  categories,  the  decision  use  of  test  requires 
special  attention  because  It  contains  two  reliability  categories:  threshold  loss, 
and  squared-error  loss.  The  decision  is  based  on  whether  the  losses  associated 
with  decision  errors  are  equally  serious  (threshold  loss)  or  not  equally  serious 


’Since  which  reliability  indexes  are  suitable  for  program  evaluation  use  of  test  is  still  not  clear ,  the 
prototype  system  will  not  handle  the  consultation  on  this  category.  Therefore,  no  further  efforts  are  spent 
on  eliciting  related  knowledge. 

2ln  Berk  ( i  984),  three  characteristics  of  reliability  categories  are  provided  (table  9. 1 ,  p.  236).  They 
are(a)  score  Interpretation,  (b)  type  of  decision  or  information  required  for  decision,  and  (c)  losses 
associated  with  decision  errors.  Each  one  of  these  can  be  used  to  distinguish  the  threshold  loss  or  squared- 
error  loss  category.  We  used  (c)  In  the  system  based  on  the  knowledge  elicited  at  second  knowledge 
engineering  session  with  Dr.  Hambieton.  Do  we  need  to  consider  other  two  characteristics? 
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(squared-error  loss).  The  default  reliability  category  for  decision  category  Is 
threshold  loss  category  because  It  Is  the  most  used  reliability  Index. 

3.  Determining  Reliability  Index 

This  section  describes  the  knowledge  coded  to  determine  reliability  index 
suitable  for  a  given  test  use.  The  decision  is  based  on  the  reliability  categories. 
Since  “program  evaluation"  use  of  test  score  does  not  yet  have  a  corresponding 
reliability  category,  It  will  not  be  considered  in  the  decision-making  Included  In 
this  phase. 

3.1.  Reliability  Index  for  Squared-Error  Loss  Function 

The  squared-error  loss  function  deals  with  the  consistency  of 
measurements  or  test  scores  (Berk,  1984).  The  decision  Involved  in  determining 
reliability  index  for  squared-error  loss  function  is  very  simple.  So  far,  there  is 
only  one  Index  required  for  the  function.  That  is  "standard  error  of  measurement”.1 
Therefore,  It  is  an  one-to-one  mapping2  from  intended  use  of  test  to  reliability 
Index  category.  This  mapping  is  coded  in  rule-30. 

3.2.  Reliability  Index  for  Threshold  Loss  Function 

The  threshold  loss  function  focuses  on  the  consistency  of  classification  of 
students  as  masters  and  non-masters  of  an  instructional  objective  based  on  a 
threshold  or  cut-off  score.  There  are  two  types  of  reliability  indexes:  (a)  decision 
consistency,  and  (b)  kappa.  Decision  consistency  estimate  is  relatively  easy  to 
compute,  and  interpret.  It  is  basically  recommended  for  use  in  every  case.  Kappa 
"provides  estimate  of  level  of  agreement  corrected  for  chance"  (Hambleton,  1988), 
but  It  Is  harder  to  interpret  than  decision  consistency.  Therefore,  It  usually 
servers  as  an  add-on  reliability  statistics  to  provide  more  Information  besides  the 
decision  consistency  for  Important  test.  The  decision  regarding  which  index  will 
be  recommended  depends  on  how  Important  is  the  test,  which  Is  represented  by 
HOW_TESTJJSED  attribute.  As  shown  In  figure  2.6,  credential tng  exams  are  very 
Important,  thus,  both  decision  consistency  and  kappa  are  recommended.  On  the 
other  hand,  day-to-day  classroom  management  and  forming  Instructional  groups 


'Berk  ( 1 984)  lists  two  Indexes  under  the  squared-error  loss  function  category.  Both  of  them  belong  to 
the  general  category,  standard  error  of  measurement. 

2ln  second  knowledge  engineering  session,  we  also  discussed  other  Indexes  for  squared-error  loss  function. 
However ,  none  of  them  are  very  significantly  used  In  practice.  In  this  prototype  system ,  only  standard 
error  measurement  statistics  for  squared-error  loss  function  will  be  considered. 
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are  less  important.  Thus,  only  decision  consistency  is  recommended  for  these  two 
types  of  test  use.  The  situation  for  longer-term  decisions  is  little  more  complex. 
It  involves  two  attributes:  (a)  HISTORICAL-DATA,  and  (b)  USER-KNOWLEDGEABLE. 
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Figure  2.6.  Selecting  RELIABILITY-INDEX 


The  decision  is  that  if  the  user  knows  that  reliability  data  has  been 
collected  for  the  test,  then  the  system  recommends  same  indexes  as  the  ones  used 
before.  Otherwise,  Information  on  users’  knowledge  about  test  measurement  is 
used  to  determine  reliability  Index.  Basically,  for  a  knowledgeable  user,  the 
system  will  recommend  both  decision  consistency  and  kappa.  Otherwise,  only 
decision  consistency  estimate  will  be  recommended.1 


1  These  rules  are  elicited  from  second  knowledge  engineering  session  with  Dr.  Hambleton.  It  Is  still  to 
simple  here,  and  may  not  fully  represent  the  knowledge  used  by  a  domain  expert  in  decision-makings. 
More  elicitation  is  needed  for  further  expanding  of  the  system. 
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4.  Determining  Reliability  Design 


This  section  documents  the  knowledge  for  selecting  which  reliability  design 
Is  appropriate  for  a  given  test.  Rule-34  Is  the  control  rule  for  determining 
reliability  design. 

4.1.  Reliability  Design.  Number  of  Administrations,  and  Number  of  Forms 
There  four  possible  reliability  designs:  (a)  test-retest  with  equivalent 
forms,  (b)  test-retest  with  same  form,  (c)  single  administration  with  one  form, 
and  (d)  single  administration  with  parallel  forms.1  Which  reliability  design  is 
appropriate  depends  on  the  number  of  test  administrations  that  a  user  Is  willing 
to  give,  and  the  number  of  test  forms  available.  Figure  2.7  presents  a  decision 
table  for  recommending  a  reliability  design. 


In  the  figure,  an  entry  “one" 
means  that  there  will  be  only  one 
form  available  or  one  administration 
possible.  "More  than  one"  means 
multiple  forms  are  available,  and 
two  test  administrations  are 
possible.  Assuming  the  information 
on  number  of  administrations  and 
forms  are  available,  each  of  the  four 
rules  concludes  about  one  particular 

Figure  2.7.  Determining  Reliability  Design,  reliability  design.  These  rules  make 

the  relationships  between 

reliability  designs  and  number  of  test  administration,  and  forms  more  clear  for 
system's  explanation  purposes.  Lengthy  consultation  might  be  required  to 
determine  number  of  possible  test  forms,  and  test  administration,  which  will  be 
described  in  later  sections. 

42.  Determining  Number  of  Test  Administrations 

If  the  reliability  Index  selected  Is  "standard  error  of  measurement" ,  there 
Is  only  one  test  administration  required  as  coded  In  rule-39.  This  applies  to  both 


Number  of  Administration 

one 

more  than  one 

Number  of  Form 

2 

o 

single 

administration 
with  same  form 
rule- 37 

test-retest 
with  same  form 

rule- 36 

2 

O 

A 

single 

administration 
with  parallel  forms 
rule- 38 

test-retest  with 
equivalent  forms 
rule- 35 

tHambleton  ( 1 988)  provides  three  possible  reliability  designs  for  CRT  test:  (a)  test-re- test  with  the 
same  form ,  ( b)  test-re-test  with  equivalent  forms,  and  (c)  single  administration.  We  further  spilt  Item 
single  administration  Into  two  designs  based  on  the  number  of  forms  required  for  the  design. 
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squared-error  loss,  and  domain  score  estimate  categories.1  However,  if  the 
reliability  index  selected  is  "decision  consistency  estimate"  or  "kappa",  then  the 
number  of  possible  test  administrations  could  be  either  one  or  two.  Thus,  the 
number  of  test  administrations  required  for  "threshold  loss"  category  depends  on 
the  reliability  Indices  selected,  the  nature  of  materials  to  be  tested,  and  other 
administrative  considerations.  The  following  discussions  focus  on  the  knowledge 
for  determining  possible  number  of  test  administrations. 

Selecting  the  number  of  test  administration  consists  of  two  steps:  (a)  query 
about  the  possibilities  of  multiple  test  administrations,  and  (b)  asks  user  to 
confirm  the  multiple  administration  if  it  is  recommended  in  step  (a).  The  key 
attributes  for  step  (a)  is  MULT  I  PLE_ADM  I N I STR  AT  I ONLPOSS I BLE ,  and  for  step  (b)  is 
USER_CONF  I  RMEDJiULT  I PLE-ADM I N I STR  AT  I  ON. 

Figure  2.8  shows  the  decision  tree  for  determining  whether  multiple 
administration  of  test  is  possible. 


Figure  2.8.  Determining  MULT  I  PLE_ADMI  Nl  STRATION_POSS  I  BLE. 


1  Derk(  1 984)  summarizes  that  there  are  two  Indices  for  squared-error  loss  category.  Both  are  based  on 
standard  error  of  measurement  ( table  9.3,  p.248-249).  For  domain  score  estimate  categories,  three  of 
the  six  indices  listed  in  table  9.4  (p.253-255)  are  labeled  as  standard  error  of  measurement. 
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In  this  decision  tree,  system  first  asks  the  user  to  specify  whether  multiple 
administration  of  the  test  is  possible  (question- 12).  If  the  user  answers  "yes",  or 
“no",  then  the  value  is  passed  to  the  MULTI  PLE_ADMINISTRATION_POSSIBLE 
attribute.  However,  if  the  user  answers  "unknown",  the  decision  will  be  based  on 
the  user's  responses  to  two  lower  level  questions:  whether  students  may 
remember  the  test  items  (question- 13)  or  whether  learning  may  occur  if  the  test 
is  administered  twice  (question- 1 4).1  To  both  questions,  an  "unknown"  response  is 
assumed  to  be  "no",  but  appropriate  messages2  will  be  displayed  to  inform 
system's  assumption. 

For  Credentialing  examinations,  it  is  impossible  to  administer  the  tests 
more  than  once.  This  knowledge  is  implemented  In  rule-44.  In  this  case,  the 
decision  tree  demonstrated  in  figure  2.8  will  not  be  invoked. 

Once  the  Information  about  whether  It  is  possible  to  administer  a  test  more 
than  once,  the  Information  will  be  used  in  step  (b)  to  determine  the  number  of  test 
administration  for  the  test.  Figure  2.9  shows  the  decision  table. 


RELIABILITY JCATEGORY  -  threshold  loss 


multiple- 

edmi  nistration_possi  Me 

yes 

yes 

yes 

no 

user_confi  rmed— 
multi  ple^sdml  nlstratlon 

yes 

no 

unknown 

N/A 

number-jof_admi  nistration 

more  than 
one 

one 

rules  used 

rule- 40 

rule- 41 

rule- 42 

rule- 43 

•Rule- 42  contains  the  message  that  assumes  no  multiple  test  administration. 


Figure  2.9.  Determining  NUMBER_OF_-ADM  I N  l  STR  AT  I  ON. 

To  ensure  the  flexibility,  the  system  asks  users  for  confirmation  when 
multiple  test  administrations  are  possible.  The  users  may  either  confirm 
system's  recommendations  or  reject  them.  If  the  users  answer  “unknown",  then 
single  administration  is  suggested. 


1  The  questions  do  not  be  limited  to  two.  The  code  structure  1$  flexible  enough  to  include  other  knowledge 
necessary  to  help  a  user  to  determine  whether  multiple  administration  of  test  is  possible  in  a  given 
situation.  The  two  questions  coded  In  the  current  version  of  the  system  are  from  second  knowledge 
engineering  session  with  Dr.  Hambleton. 

2Th!s  is  why  some  boxes  In  figure  2.8  contain  more  than  one  rule.  Rule-49  and  5 1  contain  the  messages 
that  Inform  the  users  about  system's  estimation  when  an  "unknown"  response  Is  encountered. 
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43.  Determining  Number  of  Forms 

Parallel  forms  are  required  if  the  reliability  index  category  is  either 
"squared-error  loss"  or  "domain  score  estimate".  Figure  2.9  shows  a  decision  tree 
for  determining  number  of  forms  for  the  two  index  categories  mentioned. 


c 


RELIABILITY  JCATEGORY- 
squared-error  low 
domain  score  estimate 

PARALLEL -FORM-AYAI  LABLE 
question- 1 6 


yea 

L 


number_jof_form> 
more  then  one 
rule- 52 


no 


unknown 


/ - s 

design  forms* 

V _ J 

f - \ 

assume  "no" 

design  forms* 

L _ _ / 

number-jof-form^ 
more  then  one 
rule-53 

number^of-form-"' 
more  then  one 

.  rule-54  J 

^"design  forms"  may  invoke  the  rules  to  help  users  design  parallel  forms1 
Figure  2.9.  Determining  NUMBER_OF_FORM  for  "Squared-error  Loss"  or 
"Domain  Score  Estimate"  Categories 


Rule-52,  53,  and  54  all  conclude  multiple  forms  required.  However,  rule-53 
and  54  allow  further  addition  of  knowledge  regarding  form  design  such  as  dividing 
an  existing  form,  or  recommending  strategies  for  test  length  selection  and  Items 
determination. 

For  "threshold  loss"  category,  the  matter  becomes  a  bit  complex  because 
either  simgle  form  or  multiple  forms  can  be  used  to  accessing  reliability  indices. 
Figure  2. 10  presents  a  decision  tree  for  this  category.  The  line  of  reasoning 
depends  on  whether  parallel  forms  are  available.  "If  parallel  forms  are  available, 
the  design  for  a  reliability  study  is  probably  clear"*.  However,  If  the  parallel 
forms  are  available,  but  both  forms  cannot  be  administered,  modifications  are 
needed  to  the  design  (Hambleton,  1988,  p.  6). 


Uhls  step  Is  Included  in  rule-53  or  54.  It  is  designed  for  further  expansion  of  the  knowledge  base  to 
provide  assistance  In  determining  form  length  or  test  length.  Right  now,  it  does  not  do  anything. 

*1  Interpret  this  as  If  the  parallel  forms  are  available  then  recommend  the  use  of  the  parallel  forms.  This 
Interpretation  1$  simplified,  but  more  knowledge  need  to  be  elicited  to  make  the  reasoning  more  complex. 
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PARALLEI _ FORMS— AYAI LABLE 

question- 1 6 


NUMBERJOF.ADMI  NIST  RATION 
section  4.2 


NUMBERJOF-ADMI  NISTRATION 
section  4.2 


more  then  one 


f  HOW-RELIABILITY-^ 

[  number_jof_form* 

r 

design  form* 

f  number-of-form*  T 
more  then  one 

ASSESSED 

more  then  one 

l  question- 1 7  J 

k  rule- 58  J 

^ _ 

Jt  rule-59  J 

vlth  ell  objectives 

i 


number_jof_form« 
more  then  one 
rule- 56 


vlth  semple  objectives 

i 


number_jof_form«one 
rule- 57 


*  It  is  not  clear  yet  whether  the  system  should  assume  single  form  here  or  query  for 
possible  parallel  forms. 

Figure  2.10.  Determining  NUMBER_OF_FORMS  for  "Threshold  Loss"  Category 

[Two  possible  designs  are:  ]  (a)  creating  several  forms,  where  each  form  Includes 
parallel  items  to  measure  a  fraction  of  the  objectives;  in  this  way,  the  reliability  of  all 
objective  scores  can  be  assessed  but  with  a  fraction  of  the  total  examinee  pool ,  (and)  ( b) 
assessing  the  reliability  of  a  representative  sampling  of  objectives  8nd  then  generalizing 
the  findings  to  describe  all  objectives.  With  this  second  design,  single-administration 
estimates  of  reliability  can  be  computed  for  the  remaining  objectives.  ( p.  6) 

This  information  is  represented  by  attribute  HO W_REL  1 AB I L 1 T Y_ASSESSED 
(question- 1 7).  In  the  current  implementation,  if  parallel  forms  are  not  available, 
and  multiple  administration  is  possible,  rule-59  concludes  to  use  simple  form. 
Later  on,  we  should  replace  this  rule  with  another  line  of  reasoning,  where  the 
system  will  Invoke  a  discussion  with  the  users  about  the  possible  of  creating 
parallel  forms  from  the  exiting  simple  form.1 


1  We  discussed  this  in  both  knowledge  engineering  session  with  Dr.  Hambleton.  However,  there  are  still 
some  knowledge  needed  to  be  elicited  for  extending  rule-59. 
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5.  Reporting  Consultation  Results 


Hambleton  (1988,  p.  8)  listed  ten  statistics  that  should  be  reported  In  any 
comprehensive  reliability  study:  (a)  reliability  Index,  (b)  cut-off  score,  (c)  sample 
size,  (d)  descriptive  Information  about  the  sample,  (e)  test  length,  (f)  test  score 
mean  and  standard  deviation,  (g)  test  score  distribution,  (h)  If  decisions  are  made 
on  two  occasions,  the  percent  of  examinees,  and  (I)  standard  errors.  However,  It  is 
still  not  clear  about  the  judgmental  knowledge  involved  in  determining  which 
statistics  must  be  reported,  which  statistics  may  be  reported,  and  what  Is  the 
relationships  between  reporting  statistics  and  Importance  of  test  uses. 

Besides,  based  on  Berk 's  chapter  (1984)  we  intended  to  add  some  new  Items 
in  the  report  such  as  the  definition  for  each  indices  recommended,  the  Initial 
sources  of  these  indices,  and  the  advantages  and  disadvantages  of  these  indices. 
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Appendix  A 

Questions  Generated  by  the  Reliability  Index  Expert  System 

This  appendix  listed  all  the  questions  generated  by  the  Reliability  Index 
expert  System.  All  questions  in  the  system  are  asked  in  multiple-choice  format. 
Therefore,  possible  responses  to  the  questions  are  listed  following  the  text  of 
each  question.  Each  entry  is  labeled  by  Question-1,  where  1  is  a  number.  The 
attribute  name  represented  by  the  question  Is  enclosed  In  the  parenthesis  after 
the  label. 

Question- 1  (HOW_TEST_USED): 

A  Criteria  Referenced  Test  may  be  used  for  many  purposes.  It  can  be  used 
for  day-to-day  classroom  management,  forming  Instructional  groups, 
longer-term  decisions,  and  credentialing  exams.  Some  examples  of  longer- 
term  decisions  are  assigning  students  to  special  programs,  assessing  mid¬ 
term  and  final  grades. 

Which  of  the  following  categorized  your  use  of  the  test? 

1.  day-to-day  classroom  management 

2.  forming  instructional  groups 

3.  longer-term  decisions 
4  credentialing  exams 

Question-2  (USER_SELECTED_USE): 

There  are  three  major  categories  of  CRT  score  uses: 

DESCRIPTIONS  -  We  make  statements  such  as  the  student  is  performing  at 
an  80%  level  in  the  domain  of  content  of  interest.  Such  statements  are  often 
made  for  each  objective  measured  In  a  CRT. 

DECISIONS  -  We  often  desire  to  assign  examinees  to  two  or  more  mastery 
categories  (e.g.,  pass/fall,  mastery/non-mastery).  The  classifications  may 
be  used  to  ward  diplomas,  licenses,  or  certificates,  or  to  monitor  student 
performance  an  on  objective  based  Instructional  programs. 

PROGRAM  EVALUATION  -  CRTs  are  often  used  In  curriculum  or  program 
evaluation  studies.  Average  scores  on  each  objective  or  groups  of 
objectives  are  reported  for  groups  (and  subgroups)  of  examinees 
administered  the  test. 


Which  of  the  following  categorized  your  CRT  score  use? 

1 .  descriptions 

2.  decisions 

3.  program  evaluation 

4.  unknown 

Question-3  (USE_AS_JU06E_LE  VEL_OF_PREF  I C I ENC  Y ): 

Would  you  use  the  test  results  to  Judge  the  proficiency  of  Individual 
students? 

1. yes 

2.  no 

Question-4  (USE_AS_JUD6E_LEVEL_0F_PERF0RMANCE): 

Would  you  use  the  test  results  to  judgement  the  performance  of  Individual 
students? 

1. yes 

2.  no 

Question-5  (USE_AS_C0MPARJNG_SU8GR0UP) 

Would  you  use  the  test  results  to  compare  among  subgroups? 

1. yes 

2.  no 

Questlon-6  (USER_SELECTED_USE_SUBCATEGORY): 

There  are  two  different  situations  where  reliability  information  is  valued: 
In  the  test  development  process  -  reliability  Information  collected  during  a 
field  test  can  be  Invaluable  In  advising  on  desirable  test  lengths  and  in 
Judging  the  soundness  of  the  test  items. 

As  one  of  the  criteria  used  to  evaluate  an  Intended  test  use  -  reliability 
Information  influences  the  confidence  that  users  have  regarding  the  test 
scores  and  related  decisions. 

Which  of  the  following  situation  Is  best  applicable  to  your  test? 

1.  in  the  test  development  process, 

2.  evaluation  of  Intended  test  use 

3.  unknown 
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Quest  t  on-  7  (SER I  OUSNESS_OF_M  I SCL  ASS  I F I C  AT  I  ON ): 

Are  all  mlsclassif icatlons  of  test  score  approximately  equal  In  their 
impact? 

1.  yes 

2.  no 

3.  unknown 

Question-8  (USER-OCCUPATION): 

Which  of  the  following  best  describes  your  occupation  as  a  test 
developer/user? 

1.  classroom  teacher 

2.  district  level  test  maker 

3.  state  level  test  maker 
4  test  publisher 

Quest lon-9  (KNOWS_CONCEPT_OF_RELIABIUTY): 

Do  you  have  the  general  knowledge  of  test  reliability? 

1.  yes 

2.  no 

3.  unknown 

Question- 10  (HISTORICAL.TECHNICAL-DATAJEXISTS): 

Were  any  technical  data  collected  on  the  test  before? 

1.  yes 

2.  no 

3.  unknown 

Question- 1 1  (USER_SPECIFIED-TECHNICAL_DATA): 

What  sort  of  technical  data  have  been  collected? 

1.  decision  consistency  estimate 

2.  kappa 

3.  both  of  the  above 

Question- 1 2  (USER_SELECTED_MULTIPLE_ADMINISTRATION): 

Is  it  possible  .to  administer  the  test  more  than  once? 

1. yes 

2.  no 

3.  unknown 
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Question- 1 3  (STUDENT  J*IAY_REMEMBER_TE5T): 

Will  students  be  able  to  remember  questions  or  aspects  of  the  questions 
such  as  the  passages,  or  diagrams  in  the  test? 

1. yes 

2.  no 

Quest  1  on- 1 4:  (LEARN  I  NG_BET  WEEN_ADMI  N 1  ST  RAT  IONS): 

if  the  test  is  administered  twice,  will  some  learning  take  place  between  the 
two  administrations? 

1.  yes 

2.  no 

Quest  1  on- 1 5  (USER_CONF  I RMEDJ1ULT I  PLE_ADM  I N I  ST  RAT  I  ON): 

From  the  Information  you  give,  I  think  that  it  is  possible  to  administer  the 
test  twice.  Since  for  single-administration,  strong  assumptions  must  be 
made  in  order  to  obtain  a  reliability  estimate,  I  would  recommend  that  you 
use  a  two  administration  design. 

Do  you  think  It  Is  practical  in  your  situation  to  administer  the  test  twice? 

1. yes 

2.  no 

Question- 16  (PARALLEL_FORm_AVAILABLE): 

Are  there  parallel  or  equivalent  forms  available  for  the  test? 

1. yes 

2.  no 

3.  unknown 

Question- 1 7  (HOW_REUA0IUTY_ASSESSED): 

Since  parallel  forms  cannot  be  administered,  modifications  are  needed  to 
the  reliability  design.  Reliability  can  be  assessed  In  two  ways: 

( 1 )  creating  several  forms,  where  each  form  Includes  parallel  Items  to 
measure  a  fraction  of  the  objectives;  in  this  way,  the  reliability  of  all 
objective  scores  can  be  assessed  but  with  a  fraction  of  the  total  examinee 
pools. 

(2)  assessing  the  reliability  of  a  representative  sampling  of  objectives  and 
then  generalizing  the  findings  to  describe  all  objectives.  Thus,  single¬ 
administration  estimates  of  reliability  can  be  computed  for  the  remaining 
objectives. 

How  do  you  want  the  reliability  be  assessed? 

1.  with  all  objectives 
2.  with  sample  objectives 
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